PI: Dr. Vasily Yakovlev
Contact: vasily.yakovlev@vcuhealth.org
The PI is interested in exploratory analysis and would like to find
some signature or prediction biomarker
that could help predict the survival rate. The initial focus is on the
Differential Expression analysis of
the RNAseq data from the TCGA Database and the aim is to identify all
statistically significant differences between the two groups highlighted
below. The samples are Human
from the colo-rectal patients of the TCGA database with two locations;
Rectum and Left colon.
This report presents the differential expression analysis of colon
and rectum cancer datasets. The STAR
Counts were downloaded and the analysis was performed using DESeq2
package. Finally, the results were
visualized through various plots, including a volcano plot and a
heatmap.
Sample info:
the READ (rectum) cohort: 172 unique case submitter IDs
the COAD (colon) cohort: 461 unique case submitter IDs further
stratified into sub-localizations
localization ‘descending colon’,
localization ‘sigmoid colon’,
localization ‘splenic flexure’
Analysis includes pre-processing of the TCGA data, Principal
Component Analysis (PCA), Differential Gene
Expression Analysis, visualizations and a survival plot for the two
cohorts of interest.
From the TCGA database; the READ (rectum) cohort and the COAD (colon)
cohorts were stratified. The PI would
like to compare the “rectum” group with the “left colon” group which
includes localizations
‘descending colon’, ‘sigmoid colon’, and ‘splenic flexure’. The data
included gene expression
quantification with STAR counts.
The data underwent pre-processing to prepare the data for downstream
analysis, inluding data
wrangling, and data normalization. This step included variance
stabilizing transformation (vst)
and Deseq2 package normalization to ensure data comparability across
samples.
Several PCA plots were generated to visualize the variation in the
data. Samples that are similar
to each other will cluster together in the PCA plot. PCA transforms a
large set of variables into a
smaller one that still contains most of the information in the large
set. It does this by identifying
the directions (principal components) in which the data varies the most.
The data was first normalized (vst)
so that the variance becomes independent of the mean. The axes of a PCA
plot represent the principal
components. Typically, the first two principal components (PC1 and PC2)
are plotted, as they capture the
most variance. Additionally, a 3D PCA plot was also produced to
visualize the data with the
third Principal Component.
A volcano plot was generated to visualize the differential expression
results. This plot displays -> Log2 Fold Change (x-axis): Indicates
the magnitude of expression change between colon and rectum samples.
-> -Log10 Adjusted p-value (y-axis): Represents the statistical
significance of the expression change. -> Significance Thresholds:
Horizontal and vertical lines on the plot indicate thresholds for
significance
and fold change, helping to highlight the most differentially expressed
genes.
A heatmap was created to show the expression levels of the top
differentially expressed genes. Key features
of the heatmap include -> Expression Patterns: It visualizes the
expression patterns of these genes across samples, facilitating
the
identification of clusters or patterns in gene expression. ->
Clustering: Both genes and samples are clustered to reveal similarities
and differences in expression profiles.
This R Markdown file contains code for analyzing TCGA data and to
perform PCA analysis,
Differential Gene Expression Analysis, and other bioinformatics
visualization. The code provided is released
under the MIT License and is intended for use in research and
educational projects.
MIT License
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and
associated documentation files (the “Software”), to deal in the Software
without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the
following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial
portions of the Software.
Disclaimer:
The software is provided “as is”, without warranty of any kind, express
or implied, including
but not limited to the warranties of merchantability, fitness for a
particular purpose, and
non-infringement. In no event shall the authors or copyright holders be
liable for any claim,
damages, or other liability, whether in an action of contract, tort, or
otherwise, arising from, out of,
or in connection with the software or the use or other dealings in the
software.
Publication Policy
As part of our commitment to transparency and scientific
collaboration, our bioinformatics services core
releases code and methods upon project completion. For generated code,
we maintain a private GitHub
repository during project execution where investigators and
collaborating students can contribute.
Methods are written throughout the project lifecycle and are part of our
core’s deliverables. Upon
publication, the repository becomes public and is released under an
open-source license,
ensuring that others can build upon and benefit from our work, with the
exception of code handling sensitive data.
Our core adheres to strict data security practices. Any code handling
sensitive or confidential data undergoes
rigorous review to ensure compliance with privacy regulations and may or
may not be publicly
released according to VCU’s data security and privacy policies.
We require that any results obtained from code generated during our
collaboration include a citation to the
GitHub repository, acknowledging the contributions of our analysts.
Additionally, the BISR and its source of
funding, the CCSG grant, must be included in the acknowledgment and
funding sections of manuscripts.
* Colon vs. Rectum
(Control
level: Rectum)
Descending Colon vs. Rectum
(Control
level: Rectum)
Sigmoid Colon vs. Rectum
(Control
level: Rectum)
Splenic Flexure of Colon vs. Rectum
(Control level: Rectum)
Comparision 1 Treatment: Colon vs. Control: Rectum
Comparision 2 Treatment: Descending Colon vs. Control: Rectum
Comparision 3 Treatment: Sigmoid Colon vs. Control: Rectum
Comparision 4 Treatment: Splenic Flexure of Colon vs. Control: Rectum
Expectation: We would expect to see samples that are similar to each other cluster together.
Comparision 1 Treatment: Colon vs. Control: Rectum
Comparision 2 Treatment: Descending Colon vs. Control: Rectum
Comparision 3 Treatment: Sigmoid Colon vs. Control: Rectum
Comparision 4 Treatment: Splenic Flexure of Colon vs. Control: Rectum
Volcano: Each dot represents a change in gene expression. X-axis: log2 fold-change of expression between treatment compared to the control plotted against the -log10(padj). The red line indicates the p-value < 0.05. Every point (gene) above that threshold appears to have statistically significant changes between the two conditions.
Vertical lines indicate 1.5 fold change. Genes highlighted in RED are up-regulated in the Treatment compared to the control. Genes highlighted in BLUE are down-regulated in the Treatment compared to the control. Genes in gray, do not meet the thresholds for both logFC and p-value.
Comparision 1 Treatment: Colon
vs. Control: Rectum
Comparision 2 Treatment: Descending
Colon vs. Control: Rectum
Comparision 3 Treatment: Sigmoid
Colon vs. Control: Rectum
Comparision 4 Treatment: Splenic
Flexure of Colon vs. Control: Rectum
A heatmap of zscore normalized read counts data for each comparison. The x-axis are samples, the y-axis are genes. The red color represents the magnitude of standard deviations above the mean for each read count (i.e., higher expression), and the blue is the magnitude of standard deviations below the mean (i.e., lower expression). White indicates that a read count is close to the mean. The dendrogram is clustering by samples and by RNA expression.
| Package | Version | Citation |
|---|---|---|
| AnnotationDbi | 1.66.0 | @Annotat…. |
| base | 4.4.1 | @base |
| ComplexHeatmap | 2.20.0 | @Complex…. |
| DESeq2 | 1.44.0 | @DESeq2 |
| DT | 0.33 | @DT |
| enrichplot | 1.24.2 | @enrichplot |
| ggrepel | 0.9.5 | @ggrepel |
| gplots | 3.1.3.1 | @gplots |
| here | 1.0.1 | @here |
| janitor | 2.2.0 | @janitor |
| knitr | 1.48 | @knitr20…. |
| org.Hs.eg.db | 3.19.1 | @orgHsegdb |
| pacman | 0.5.1 | @pacman |
| plotly | 4.10.4 | @plotly |
| RColorBrewer | 1.1.3 | @RColorB…. |
| reticulate | 1.38.0 | @reticulate |
| rmarkdown | 2.27 | @rmarkdo…. |
| RNASeqBits | 0.1.0 | @RNASeqBits |
| scales | 1.3.0 | @scales |
| TCGAbiolinks | 2.32.0 | @TCGAbio…. |
| tidyverse | 2.0.0 | @tidyverse |